STA6235: Modeling in Regression
y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k
\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon},
where
\mathbf{Y} is the n \times 1 column vector of responses.
\mathbf{X} is the n \times p design matrix.
\boldsymbol{\beta} is the p \times 1 column vector of regression coefficients.
\boldsymbol{\varepsilon} is the n \times 1 column vector of error terms.
\begin{align*} \mathbf{Y} &= \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} \\ \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix} &= \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \cdots & x_{1,(p-1)} \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & x_{n,1} & x_{n,2} & \cdots & x_{n, (p-1)} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_{p-1} \end{bmatrix} + \begin{bmatrix} \varepsilon_1 \\ \vdots \\ \varepsilon_n \end{bmatrix} \end{align*}
where
\mathbf{y} is the n \times 1 column vector of responses.
\mathbf{X} is the n \times p design matrix.
\boldsymbol{\beta} is the p \times 1 column vector of regression coefficients.
\boldsymbol{\varepsilon} is the n \times 1 column vector of error terms.
Recall the error term, \boldsymbol{\varepsilon} = \left[ \varepsilon_1, \ \varepsilon_2, \ \cdots, \ \varepsilon_n\right]^\text{T}, a vector of independent normal random variables with
a n \times 1 expectation vector, \text{E}\left[\boldsymbol{\varepsilon}\right] = 0 and
a n \times n variance-covariance matrix \sigma^2 \mathbf{I},
\sigma^2 \mathbf{I} = \sigma^2 \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & 0 \\ 0 & 0 & \cdots & 1 \end{bmatrix} = \begin{bmatrix} \sigma^2 & 0 & \cdots & 0 \\ 0 & \sigma^2 & \cdots & 0 \\ \vdots & \vdots & \ddots & 0 \\ 0 & 0 & \cdots & \sigma^2 \end{bmatrix}
Note that we do not have to assume independence …
\boldsymbol{\hat{\beta}} = (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\mathbf{Y}
\mathbf{\hat{Y}} = \mathbf{X} \boldsymbol{\hat{\beta}} = \mathbf{X}\left(\mathbf{X}'\mathbf{X}\right)^{-1} \mathbf{X}'\mathbf{Y}
\boldsymbol{\hat{\varepsilon}} \equiv \mathbf{e} = \mathbf{Y} - \mathbf{\hat{Y}} = \mathbf{Y} - \mathbf{X}\boldsymbol{\hat{\beta}}
\mathbf{X} = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \cdots & x_{1,(p-1)} \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & x_{n,1} & x_{n,2} & \cdots & x_{n, (p-1)} \end{bmatrix}
Why is there a column of 1’s?
What is p?
Note that for estimation to happen, \mathbf{X} must be full rank.
A r \times r matrix is said to be full rank if its rank is r.
This means that the matrix is nonsingular (it has an inverse).
Full rank also means that all columns are linearly independent.
Full rank also means that all columns are linearly independent.
Full rank also means that all columns are linearly independent.
Full rank also means that all columns are linearly independent.
Full rank also means that all columns are linearly independent.
Full rank also means that all columns are linearly independent.
Why am I harping on this?
Full rank also means that all columns are linearly independent.
Why am I harping on this?
This is why we have reference groups for categorical predictors!!
If we were to include all c classes, we would no longer have a full rank design matrix.
This is why when we do not use indicator variables, software “chooses” a reference group for us.
\begin{align*} \mathbf{\hat{Y}} &= \mathbf{X} \boldsymbol{\hat{\beta}} \\ &= \mathbf{X}\left(\mathbf{X}'\mathbf{X}\right)^{-1} \mathbf{X}'\mathbf{Y} \\ &= \left( \mathbf{X}\left(\mathbf{X}'\mathbf{X}\right)^{-1} \mathbf{X}' \right) \mathbf{Y} \end{align*}
We call \mathbf{X}\left(\mathbf{X}'\mathbf{X}\right)^{-1} \mathbf{X}' the hat matrix.
Note that the hat matrix is symmetric and idempotent (i.e., \mathbf{H}\mathbf{H} = \mathbf{H}).
\begin{align*} \mathbf{e} &= \mathbf{Y} - \mathbf{\hat{Y}} \\ &= \mathbf{Y} - \mathbf{H}\mathbf{Y} \\ &= \left(\mathbf{I} - \mathbf{H} \right) \mathbf{Y} \end{align*}
Note that we can go through the matrix-based calculations for everything we’ve learned so far, but that is not the point of this course.
Goal: give you an appreciation/understanding of what’s going on under the hood.
If you are interested, you can read more here:
Now, let’s talk about leaving the normal distribution…
We will now extend to generalized linear models (GzLM).
GzLMs have three components:
Random component: \mathbf{Y}, the response.
Linear predictor: \mathbf{X}\boldsymbol{\beta}, where \boldsymbol{\beta} is the parameter vector and \mathbf{X} is the n \times p design matrix that contains p-1 explanatory variables for the n observations.
Link function: g\left(\text{E}[\mathbf{Y}]\right) = \mathbf{X}\boldsymbol{\beta}
The random component consists of the response, y
y has independent observations
y’s probability density or mass function is in the exponential family
A random variable, y, has a distribution in the exponential family if it can be written in the following form: f(y | \theta, \phi) = \exp \left\{ \frac{y\theta-b(\theta)}{a(\phi)} + c(y|\phi) \right\},
for specified functions a(\cdot), b(\cdot), and c(\cdot).
\theta is the parameter of interest (canonical or natural)
\phi is usually a nuissance parameter (scale or dispersion)
Let’s show that the normal distribution is in the exponential family.
Assume y \sim N(\mu, \sigma^2); we will treat \sigma^2 as a nuisance parameter.
\begin{align*} f(y | \theta, \phi) &= \left(\frac{1}{2\pi \sigma^2} \right)^{1/2} \exp\left\{ \frac{-1}{2\sigma^2} (y-\mu)^2 \right\} \\ &= \exp\left\{ \frac{-y^2}{2\sigma^2} + \frac{y\mu}{\sigma^2} - \frac{\mu^2}{2\sigma^2} - \frac{1}{2} \ln\left(2\pi\sigma^2\right) \right\} \\ &= \exp\left\{\frac{y\mu-\mu^2/2}{\sigma^2} - \frac{y^2}{2\sigma^2} - \frac{1}{2} \ln\left(2\pi\sigma^2\right) \right\} \\ &= \exp \left\{ \frac{y\theta-b(\theta)}{a(\phi)} + c(y|\phi) \right\} \end{align*}
\begin{align*} f(y | \theta, \phi) &= \exp\left\{\frac{y\mu-\mu^2/2}{\sigma^2} - \frac{y^2}{2\sigma^2} - \frac{1}{2} \ln\left(2\pi\sigma^2\right) \right\} \\ &= \exp \left\{ \frac{y\theta-b(\theta)}{a(\phi)} + c(y|\phi) \right\} \end{align*}
\theta = \mu
\phi = \sigma^2
a(\phi) = \phi
b(\theta) = \mu^2/2 = \theta^2/2
c(y|\phi) = -(1/2 \ln(2\pi\sigma^2) + y^2/(2\sigma^2))
Why are we restricted to the exponential family?
There are general expressions for model likelihood equations.
There are asymptotic distributions of estimators for model parameters (i.e., \hat{\beta}).
We have an algorithm for fitting models.
Let us first discuss notation.
x_{ij} is the jth explanatory variable for observation i
i=1, ... , n
j=1, ..., p-1
Each observation (or subject) has a vector, x_i = \left[1, x_{i,1}, ..., x_{i, p-1}\right]
The leading 1 is for the intercept, \beta_0.
These are put together into \mathbf{X}, the design matrix.
The linear predictor relates parameters (\eta_i) pertaining to E[y_i] to the explanatory variables using a linear combination, \eta_i = \sum_{j=1}^{p-1} \beta_j x_{ij}
In matrix form, \boldsymbol{\eta} = \mathbf{X} \boldsymbol{\beta}
This is a linear predictor because it is linear in the parameters, \boldsymbol{\beta}.
The predictors can be nonlinear.
We call \mathbf{X} the design matrix.
There are n rows (one for each observation).
There are p columns (p-1 predictors – the pth column is for the intercept).
In most studies, we have p < n.
Some studies have p >>> n, e.g., genetics.
GzLMs treat y_i as random, x_i as fixed.
These are called fixed-effects models.
There are also random effects \to fixed effects + random effects = mixed models.
The link function, g(\cdot), connects the random component with the linear predictor.
Let \mu_i = \text{E}[y_i]:
In the exponential family representation, a certain parameter serves as its natural parameter.
The link function that transforms \mu_i to the natural parameter, \theta, is called the canonical link.
Today we have talked about the underlying matrix algebra in regression analysis.
We also introduced generalized linear model.
Moving forward, we will be learning how to construct regression models using the following distributions:
Note that this is just the beginning of modeling.
There are more complex models that are beyond the scope of this course!